Skip to content

Conversation

@hudson-ai
Copy link
Collaborator

@hudson-ai hudson-ai commented Apr 22, 2025

Behavior changes:

  • lazy execution (lm += foo() always returns immediately)
  • execution triggered by stateful access, e.g. str(lm), lm[key], etc.

Introduces:

  • async stateful accessors, e.g. await lm.get_async(key) (API is still a WIP)
  • async guidance functions (i.e. @guidance decorator on async def functions
    • allows usage of async accessors inside of guidance functions
    • as well as other async apis, semaphores, etc.

Note: async accessors are fully compatible with non-async guidance functions (even stateful ones). I.e. you don't have to rewrite your existing guidance functions as async to get the concurrency benefits of async accessors farther up the stack.

Here's an example usage:

  1. The main logic is encapsulated in a normal (non-async) @guidance function extract_image_data -- it does not need to be aware that its callers may be async!
  2. An async @guidance function get_and_describe_image that uses external async functions, namely the get method of an httpx.AsyncClient.
    • Note that while async accessors are perfectly valid, non-async accessors on the Model object (lm) are disallowed inside of async @guidance functions and will raise an exception. We could probably "fix" this, but it's honestly kind of a nice safeguard against shooting ourselves in the foot.
  3. An async main function that gathers some number of coroutines returned by an async accessor on each of our unevaluated guidance programs.
import httpx
import asyncio
from guidance import *

@guidance
def extract_image_data(lm, image_bytes):
    with user():
        lm += "What is in this image?"
        lm += image(image_bytes)
    with assistant():
        lm += json(
            schema = {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "colors": {"type": "array", "items": {"type": "string"}},
                    "objects": {"type": "array", "items": {"type": "string"}},
                },
                "required": ["description", "colors", "objects"],
                "additionalProperties": False
            },
            name = "data",
        )
    return lm

@guidance
async def get_and_describe_image(lm, client):
    resp = await client.get("https://picsum.photos/200")
    resp.raise_for_status()
    image_bytes = resp.content
    lm += extract_image_data(image_bytes)
    return lm

async def main():
    lm = models.OpenAI("gpt-4o-mini", echo=False)
    async with httpx.AsyncClient(follow_redirects=True) as client:
        lms = [
            lm + get_and_describe_image(client)
            for _ in range(10)
        ]
        datas = await asyncio.gather(*[lm.get_async("data") for lm in lms])
    return datas

@guidance functions can also be naively parallelized (regardless of whether or not they are async) via the batched entrypoints:

lms = lm.run_batched([func_1(), ... func_n()])
lms = await lm.async_run_batched([func_1(), ... func_n()])

Note that these entrypoints actually run the functions and are not lazy like += is.

TODOs:

  • fix stateful capture blocks
  • put token_count on state
  • how to trigger streams?
  • add example usage to this PR
  • fix and un-comment calls to vis/renderer
  • stabilize async accessor api
  • make a decision about the "ambiguous forking" problem
  • do some profiling experiments to ensure we're not introducing unnecessary overhead (e.g. compare to manual thread-based parallelism)
  • documentation

@riedgar-ms
Copy link
Collaborator

So the synchronous versions just do a Task.run() (or whatever it is)? Presumably that spins up a short-lived event loop.... I'm guessing we're not concerned about performance implications on that?

@hudson-ai
Copy link
Collaborator Author

So the synchronous versions just do a Task.run() (or whatever it is)? Presumably that spins up a short-lived event loop.... I'm guessing we're not concerned about performance implications on that?

We're maintaining a single long-lived event loop in a daemon thread (which has its own implications I suppose), so we just submit the coroutine and block the main thread until it's ready.

The nice thing is that this is only happening at the very top-level entry point, so we don't need multiple threads or anything like that to support recursive calls. Getting that working without deadlocks was an an interesting exercise -- more than happy to look at that code together!

@codecov-commenter
Copy link

codecov-commenter commented Apr 24, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 68.28087% with 131 lines in your changes missing coverage. Please review.

Project coverage is 55.73%. Comparing base (3918b36) to head (9202113).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
guidance/_ast.py 68.37% 37 Missing ⚠️
guidance/models/_base/_model.py 77.86% 27 Missing ⚠️
guidance/models/experimental/_vllm.py 28.57% 20 Missing ⚠️
guidance/_reentrant_async.py 68.29% 13 Missing ⚠️
guidance/models/_openai_base.py 58.33% 10 Missing ⚠️
guidance/models/_azureai.py 0.00% 9 Missing ⚠️
guidance/models/_base/_interpreter.py 88.88% 4 Missing ⚠️
guidance/models/_base/_state.py 50.00% 4 Missing ⚠️
guidance/_guidance.py 50.00% 3 Missing ⚠️
guidance/library/_gen.py 0.00% 2 Missing ⚠️
... and 1 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1183       +/-   ##
===========================================
+ Coverage   40.63%   55.73%   +15.10%     
===========================================
  Files          62       63        +1     
  Lines        4782     4972      +190     
===========================================
+ Hits         1943     2771      +828     
+ Misses       2839     2201      -638     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Harsha-Nori
Copy link
Member

Thanks for the push here @hudson-ai -- fantastic work, really.

I've always been a huge fan of Jax's async dispatch model, and want to better understand why benefit 2 will no longer apply in an async dispatch world. Can't we keep e.g. a debounce buffer that batches objects as much as we can, thereby getting most of the benefit anyway?

There might be a solution here that replaces lazy eval with non-blocking eager eval, akin to jax's async dispatch, where we could introduce a block_until_ready method (or just its async counterpart). But I am hesitant, mainly due to lazy execution's benefit no. 2 above.

@hudson-ai
Copy link
Collaborator Author

Thanks for the push here @hudson-ai -- fantastic work, really.

I've always been a huge fan of Jax's async dispatch model, and want to better understand why benefit 2 will no longer apply in an async dispatch world. Can't we keep e.g. a debounce buffer that batches objects as much as we can, thereby getting most of the benefit anyway?

There might be a solution here that replaces lazy eval with non-blocking eager eval, akin to jax's async dispatch, where we could introduce a block_until_ready method (or just its async counterpart). But I am hesitant, mainly due to lazy execution's benefit no. 2 above.

Thanks @Harsha-Nori! And I appreciate the input / question. I don't honestly know the answer -- maybe some kind of buffering would work. Just going to think out loud a bit...

Let's say we have a chain of lm objects:

lm_1 = lm + foo(name="foo")
lm_2 = lm_1 + bar(name="bar") 
lm_3 = lm_2 + baz(name="baz")

With lazy execution as it's implemented in this PR, nothing gets executed until we do something like lm_3["bar"], at which point, we run the chain foo(...) + bar(...) + baz(...). If we try to access an earlier one, e.g. lm_2["bar"], we have to run the chain foo(... ) + bar(...), and we may get a different answer.

I'm imagining that if we did async dispatch + eager execution (no buffering), each of lm_1, lm_2, and lm_3 would essentially have a Future under the hood, with the bar part of lm_2 being unable to execute until the foo part of lm_1 does, etc.

With debounce-style buffering, we could track parent-child relationships, and noting that lm_1 and lm_2 both have children, we wouldn't run anything for them at all, only computing lm_3's foo(... ) + bar(...) + baz(...). But we'd then have to somehow back-fill lm_1 and lm_2, e.g. in case someone tries to access lm_1["foo"].

This doesn't seem too bad, but I think the story gets far more complicated once we start having branching calls / arbitrary DAGs.

E.g.

for _ in range(100):
   lm += qux()

lm_1 = lm
for _ in range(100):
   lm_1 += foo()

lm2 = lm
for _ in range(100):
  lm_2 += bar()

lm_1 and lm_2 share a common ancestor, namely lm with its 100 quxes. What if both of them start trying to compute their chains (qux() + ... + qux() + foo() + ... + foo() and qux() + ... + qux() + bar() + ... + bar(), respectively? Do they have to compete to acquire a lock on lm to make sure only one value gets computed for qux() + ... + qux()? If so, that means we can't parallelize the foos and the bars. For non-trivial DAGs, this means we probably miss a ton of speedup opportunities for things that should be embarassingly parallel.

If we can figure out the right way to do this "back-filling", I kind of like the idea. But it's also a bit spooky... Thoughts?

@hudson-ai
Copy link
Collaborator Author

Some kind of lm.run() is a lot less magic and in a lot of ways, a lot more cumbersome (e.g. having to call run before every getitem, lest an exception...). But it's another approach to remove ambiguities and keep everything immutable.

@nopdive I know you're a fan of async dispatch. Any thoughs on your end?

@hudson-ai
Copy link
Collaborator Author

Notes / status update for anyone watching this --

  • Everything works, but the sticky points are still surrounding API
  • I'm currently working on "backfilling" discussed above in order to get rid of the ambiguity that comes with "forking". @nopdive and I outlined a version of that together that I think has acceptable ergonomics.
  • I'm leaning towards eliminating the async_get function and its siblings in favor of something that feels more like async dispatch, i.e. await lm.block_until_ready() or something of the sort. But deciding this can wait until the backfilling stuff is done

@hudson-ai hudson-ai marked this pull request as draft July 25, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants